Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
ODTrans: Fault Tolerant Transaction Protocols for the Cloud Data Store
CHENG Xu;LI Hongyan;WANG Tengjiao;YANG Dongqing
Acta Scientiarum Naturalium Universitatis Pekinensis    DOI: 10.13209/j.0479-8023.2015.011
EmBIOS: A BIOS Design for Embedded System Supporting MS Windows
LI Hao,ZHENG Yansong,PANG Jiufeng,TONG Dong,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract806)      PDF(pc) (2751KB)(465)       Save
The authors present EmBIOS, a compatible BIOS design for embedded system to support desktop OS such as MS Windows. To effectively achieve OS compatiblity, a simulator BIOS which could boot desktop OS in simulator environment is divided into multiple interrupt service routines. Then by extending and transplanting interrupt service routines to traditional embedded firmware environment, EmBIOS enables initialization of embedded system with existing firmware, and provides BIOS compatibility required by desktop OS. The functional correctness and OS compatibility are guaranteed through running windows and its typical applications on PKUnity86 FPGA and silicon. Experimental results demonstrate that the portability of EmBIOS design and its acceptable boot up performance compared with a commercial embedded BIOS.
Related Articles | Metrics | Comments0
Application-Specific Graphical Caching Scheme for Thin-Client Computing
ZHANG Yang,GUAN Xuetao,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract629)            Save
The authors investigate raw pixel redundancy caused by repainting application’s graphical objects and propose an application-specific graphical caching scheme to recognize and reduce this class redundancy. Effectiveness of the scheme is proved by implementation in frame buffer based thin-client system VNC. The experimental results show that the scheme could reduce about 17.8%-22.7% network traffic and most of high latencies caused by screen redundancy for the tested scenarios. Meanwhile the scheme costs only little additional computation and memory resource.
Related Articles | Metrics | Comments0
Extending Virtual Machine Memory with Hypervisor Exclusive Cache
NIU Yan,YANG Chun,XIA Yubin,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract823)            Save
It is hard to accurately predict the memory demand of a virtual machine. Moreover, it is not reliable to request other virtual machines to release memory. Under-provision of memory will lead to severe performance degradation. To mitigate the impact, a hypervisor exclusive cache (HECache) is developed to extend the available memory of a virtual machine. A certain amount of memory is preserved as HECache in advance. The failed memory access in the VM is forwarded to HECache. All virtual machines running on the physical machine share HECache and can use it immediately. Through donating a little memory, all virtual machines can use more memory. The experiments conducted with both micro-benchmarks and real applications show that HECache can achieve up to 7. 9 times better performance, and the overhead is not significant compared with allocating the same amount of memory directly to a virtual machine. In addition, HECache is transparent to applications, and is complementary to the existing techniques such as ballooning, page-sharing, hotplug, etc.
Related Articles | Metrics | Comments0
A Comprehensive Study of Executing ahead Mechanism for In-Order Microprocessors
WANG Xiaoyin,TONG Dong,DANG Xianglei,LU Junlin,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract620)            Save
The authors explore the design space of in-order executing ahead processors, and conduct sensitivity analysis of the executing ahead mechanism to the cache hierarchy and memory latency. It is demonstrated that reusing the pre-executed results is highly effective in improving performance and reducing energy consumption. The results also show that propagating valid data values between stores and dependent loads with a small store cache increases performance significantly. An in-order executing ahead processor with a 32-entry store cache and a 128-entry FIFO for preserving and reusing results increases performance by 24.07% over the baseline processor, with an energy overhead of 4.93%. Furthermore, it is revealed that executing ahead is necessary for hiding memory access latencies even with a very large cache hierarchy. With increasing memory latency, the performance and energy-efficiency benefits provided by executing ahead are more significant.
Related Articles | Metrics | Comments0
Standard-Cell-Based Temperature Sensor with Calibrated Supply Noise Tolerance
TIE Meng,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract723)            Save
A standard cell based temperature sensor with calibrated tolerance for supply shift is proposed. Traditional digital circuit temperature sensors have large error caused by supply voltage shift since they are sensitive to supply voltage. The pure standard cell attribute makes the sensor very easy to be designed with normal digital circuit design flow. After 2-voltage calibration, error caused by almost 0.1 V supply shift is reduced to 28. 5℃ compared to 90℃ of previously proposed dual-ring sensor.
Related Articles | Metrics | Comments0
A Basic-Block Reordering Algorithm Based on Neural Networks
ZHANG Jiyu,LIU Xianhua,LIANG Kun,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract889)            Save
The authors present a basic-block reordering method that detects typical structures in the control-flow graph. It utilizes the architecture-specific branch cost model and execution possibilities of control-flow edges to estimate the possible layout costs of specific sub-structures. The layout with the minimal cost estimation would be chosen. The authors further investigate a novel approach to apply neural network to predict execution possibility for each edge. A set of programs are chosen to record particular static information of the edges in the typical structures. The data include the knowledge about the relationship between static program features and dynamic behaviors. It is fed to train an improved back propagation neural network (RPROP). The algorithm is implemented based on a simple pipeline UniCore microprocessor. Experiment result shows that it improves programs?performance about 8% , which indicates that the execution possibility of edges may be predicted using machine learning techniques.
Related Articles | Metrics | Comments0
Microarchitectural Design Space Exploration via Support Vector Machine
PANG Jiufeng,LI Xianfeng,XIE Jinsong,TONG Dong,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract733)            Save
The authors propose an approachto reducethe number of required si mulations, simulate on sampled design points, and use it to construct informative and predictive support vector regression models. Having captured the interacting effects of design parameters, the models predict outputs for design points that are not simulated. The prediction time of model can be negligible compared with detailed simulation. The optimal design point determined by prediction is very close to that of simulation for most applications and provides an efficient wayto cull huge design space. Trained on only 0.26 % design points, the models yield mean relative prediction error as low as 0 .52 % for performance and 1 .08 % for power. Correlation analysis demonstrates that prediction output is highly correlated with simulated observation. The average squared correlation coefficient is 0.728 for performance models while 0.703 for power models, which implies that support vector regressions capture most of relationships among design parameters. The model also provides a predictive probability interval for each prediction, which is informative for computer architects.
Related Articles | Metrics | Comments0
Improvement of the Interactive Performance Isolation of Virtual Machines on Xen Platform
XIA Yubin,YANG Chun,NIU Yan,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract648)            Save
The authors address the problemthat in a highly-consolidated environment, there are continuous peaks of network latency of guest OS. Three optimizations of VM scheduler are designed and implemented to improve interactive performance isolation, including cooperative preemption, preempt-back and accurate accounting. None of these optimizations needs guest OS to be modified. The evaluation results show that with 8 computing-intensive VMs running concurrently, the average of top 5% network latency of other 8 VMs is reducedto as mush as 0.93% of the original one, and the one of web-mail browsing by Firefox is reduced to 56.1%.
Related Articles | Metrics | Comments0
Analysis and Practice of a SoC Hardware Kernel for MS Windows
ZHENG Yansong,TONG Dong,LI Hao,PANG Jiufeng,WNAG Keyi,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract719)            Save
The authors study a method that develops a SoC hardware kernel for MS Windows. The method captures the basic system function specification of hardware kernel through multiple simulation execution and gradual drawoff, on the premise that the system is MS Windows compatible. The experiment indicates that the complexity of the hardware kernel is simpler drastically than that of the whole system, and that the requirement of hardware kernel among MS Windows versions is different obviously. Moreover, the SoC hardware kernel for MS Windows 98 is verified on the FPGA prototype.
Related Articles | Metrics | Comments0
CacheCompress: A Novel Approach for Test Compression for IP Cores
FANG Hao,SONG Xiaodi,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract592)            Save
A novel test data compression technique named CacheCompress is proposed. Different from the previous static dictionary based techniques, this dictionary is dynamic. During testing the dictionary is accessed by read and write operations and only needs to keep the most frequency used data thus to largely decrease the memory size requirement and eliminate the explicit dictionary initialization step. Experiments show that CacheCompress achieves 30% higher compression ratio than other recent compression schemes while the dictionary size dramatically reduces to 1‰.
Related Articles | Metrics | Comments0
RiTLB: iTLB Design Based on Memory Region Reusing
XIE Jinsong,TONG Dong,LI Xianfeng,PANGJiufeng,WANG Keyi,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract781)            Save
In order to design iTLB by memory region reusing, its comparison bits of lookup are reduced through the memory region encoding technology firstly, which encodes the higher-order bits of VPN with a very shorter memory region ID before the VPNis sent to iTLB. Secondly, the memory region IDis reused before the next memory region is switched into. Compared to the baseline iTLB, experimental results show the average dynamic power, delay and area of the new design decrease by 62.84%, 9.96% and 44.78% respectively, with only 0.23% average IPC reduction.
Related Articles | Metrics | Comments0
A Profit-Driven Algorithm for Semantic Code Motion
NIE Jiutao,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract796)            Save
The potential reasons of negative effects of aggressive code motion were analyzed. The authors built the profit model and proposed a profit-driven semantic code motion algorithm, which determined if an existing result should be reused. The new algorithm was i mplemented in GCC-4.2.0. The experimental results achieved from SPEC2000 on an X86 machine show that the code generated by the GCC using the new algorithmis 6.8% and 2.6% faster on average than that using semantic code motion and that using the GCCs original code motion algorithm GVNPRE .
Related Articles | Metrics | Comments0
Maximum Power Analysis Based on Bayesian Inference and Vector Compression Techniques
CHEN Jie,LI Xianfeng,TONG Dong,WANG Keyi,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract687)            Save
To resolve oversize time consuming problem in simulation based maximum power analysis, Bayesian power model based on slice analysis is proposed. This model selects the input vector set which may generate maximum power and performs accurate power estimation for the compact sequence. The relationship between signal switch density and maximum power generation is analyzed, and then an input vector generation platform with switching density self-adaptation computing and Bayesian vector compression is proposed. The experimental results indicate that, Bayesian vector compression method results in 1005 times average estimation time speed-ups, and the average maximum-power error is 2.40%. When using vector generation method based on self-adaptation computation and Bayesian vector compression, the maximum power bottom limit can be increased with 1.99%, and average speed-ups reaches 163 times.
Related Articles | Metrics | Comments0
Clock Skew Scheduling for Area Optimization
WANG Kui,DONG Haiying,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract557)            Save
A new clock skew scheduling algorithm is proposed. This algorithm generates timing constraints which can effectively promote the area optimization of logic synthesis. During clock skew scheduling, the slacks are not equally assigned to the arcs in critical cycles. In stead, they are assigned according to the arc weights which are calculated considering the area impact of the corresponding paths. Experiment results show that this approach can efficiently reduce area of logic synthesis results compared with the traditional clock skew scheduling algorithm, without degrading the performance.
Related Articles | Metrics | Comments0
An Arbitration Approach of Efficient BandwidthAllocation and Low Latency for SoC Communication
LU Junlin,LIU Dan,TONG Dong,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract610)            Save
A novel arbitration approach for SoC communication is presented. It provides fine-grained bandwidth allocation which is based on dynamically updated records of communication status. The arbiters in NoC routers, multi-port DRAM controllers and shared buses can adopt this approach to improve system performance. This proposed approach is evaluated with a metric called bandwidth shortage, which reflects the closeness of the actual bandwidth allocation to an optimal one. Experimental results reveal that this arbitration approach can reduce the bandwidth shortage decreases by 13%, and shorten the communication latency by 37.5%. Furthermore, the results of hardware implementation show that it is efficient in area and timing for large-and medium-scale SoC designs.
Related Articles | Metrics | Comments0
A Fast Hierarchical Multi-Objective Mapping Approach for Mesh-Based Networks-on-Chip
LIN Hua,ZHANG Liang,TONG Dong,LI Xianfeng,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract532)            Save
The authors proposes a fast hierarchical multi-objective mapping approach (HMMap) for mesh-based NoC Based on partition and multi-objective heuristic techniques, HMMap automatically maps large number of IP cores onto NoC architecture and makes good tradeoffs between communication energy and latency Experimental results show that proposed approach achieves shorter execution time, lower energy and latency compared with others With the increasing of NoC size, the optimization effect of HMMap becomes more obvious
Related Articles | Metrics | Comments0
An Age Encoding Based Bloom Filter Algorithm for Load-Store Queue Energy Reduction
ZHAO Yulai,TONG Dong,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract565)            Save
The load-store order violations and load-load order violations are considered in multithreaded or multiprocessor systems, and the counter-based bloom filter algorithm is improved by eliminating false positives through age encoding. The filtering ratio is improved by over 5% with no impacts on pipeline timing or performance.
Related Articles | Metrics | Comments0
SSDC: A Split Data Cache Design for Sequential Access Intensive Applications
LIU Shu,GOU Xiaogang,QU Ning,LI Xianfeng,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract591)            Save
Caches are widely used to reduce the speed gap between processors and memories. However, the spatial locality of sequential data accesses existing in many popular applications is not well exploited by conventional data cache. In response to these problems, the Split Sequential Data Cache (SSDC) is proposed, in which the sequential access detector can predict whether data accesses are sequential, and direct them to the right sub cache. Experiments show that the SSDC outperforms the conventional data cache and other schemes. It reduces the miss rate of applications with intensive sequential data accesses with only a little increment of bandwidth requirement. Meanwhile, the experimental results on SPEC2000Int show that SSDC does not hurt the performance of applications without large sequential accesses.
Related Articles | Metrics | Comments0
A Low-Leakage Pipelined Instruction Cache Design
SUN Hanxin,WANG Xiaoyin,TONG Dong,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract734)            Save
Pipelined level one instruction cache (PIL1) has been proposed to improve instruction fetch bandwidth in high frequency processor. However, few researches in the literature have focused on reducing the leakage power in PIL1. Here,the authors observe that the PIL1 structure naturally lends itself to provide inherent leakage power saving opportunities. Based on this observation, the authors propose to manage cache line activities according to the demand of the fetch address, which activates only the requested line and keeps others in low-voltage mode, thereby saving leakage power effectively. Simulation results demonstrate that the PIL1 leakage power is reduced by an average of 77.3%. Meanwhile, the performance degradation is only 0.32% and no timing overhead is induced.
Related Articles | Metrics | Comments0
A Semi-Centralized Computing Model for Network Computer Systems
YANG Chun,XIA Yubin,NIU Yan,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract586)            Save
A semi-centralized computing model is proposed for network computer systems, which enables clients to participate more computing workloads and to provide seamless operating experience for users while all management advantages of traditional network computer systems are preserved. The authors investigates the strategies for computing partition, input integration and display integration, and implements a prototype of video player based on the semi-centralized computing model. Experimental results show that it can provide seamless video playback and reduce server load dramatically.
Related Articles | Metrics | Comments0
Power-Aware Gated Clock Routing with Merging Cost Backward Annotation Using Simulated Annealing Method
DUAN Lian,XU Hu,WANG Kui,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract748)            Save
Traditional clock routing algorithms can be extended to embrace clock gating by merging minimum switching capacitance node pairs in the bottom-up phase. However, optimizing switching capacitance in the current merging nodes will affect their ancestors' gating chances, which may deteriorate the power consumption. A zero-skew gated clock routing algorithm is proposed to solve this problem. It can reduce the total switching capacitance by evaluating the merging cost of this effect using the result derived from the clock tree generated in the last round. As the result needs to be optimized in iterations, this algorithm employs a simulated annealing technique. At each iteration, the clock tree reconstructs using back-annotated merging cost information and new constraints are generated for optimization in the next round. Experiment results show that this algorithm can achieve up to 23% power reduction compared to the traditional Greedy-DME algorithm.
Related Articles | Metrics | Comments0
Hierarchical Network-on-Chip Design Method
WANG Hongwei,LU Junlin,TONG Dong,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract642)            Save
With the development of VLSI technology and increasing complexity of System-on-Chip applications, on-chip communication architecture design encounters some problems, such as throughput, power, signal integrity, latency and clock synchronization, Network-on-Chip (NoC) was introduced. With on-chip communication's specific pattern, it is of great significance to design hierarchical Network-on-Chip to improve communication performance and reduce hardware cost. This paper puts forward a hierarchical NoC design method. According to the technology and application requirements, researchers can generate several IP core subsets (“cluster”), and design a NoC architecture as inter-cluster communication requirements. Experiments show with hierarchical NoC design method, this method can improve system performance efficiently, decrease hardware cost, and meet Quality-of-Service requirements at the same time.
Related Articles | Metrics | Comments0
CMOS Combinational Circuit Leakage Power Reduction Using Genetic Algorit
ZHAO Xiaoying,YI Jiangfang,TONG Dong,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract648)            Save
A leakage power reduction platform for CMOS combinational circuits by means of input vector control is presented. Genetic algorithm is used for searching minimum leakage vector and circuit status difference is used as fitness function. Experimental results show that this circuit status difference based genetic algorithm can achieve satisfied leakage power reduction, and runtime is reasonable. This method has no requirement for HSpice simulation and independent from target technology library.
Related Articles | Metrics | Comments0
Characterizing the d-TLB Behavior of Typical Applications on Network Computer
QU Ning,YUAN Peng,GUAN Xuetao,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract780)            Save
Network computer is an interactive device in thin-client computing environment, and studying the behavior of typical applications on this platform is important to the microprocessor design and system development. Based on PKUnity network computer platform, this paper analyzes the d-TLB miss rate and performance penalty of many typical applications under different d-TLB structures and page sizes. The experiment results explain the advantage of TLB design in PKUnity SoC which satisfies the requirement of lower power and low complexity.
Related Articles | Metrics | Comments0
GATEST: A Validation Platform of Automatic Simulation Vectors Generation Using Genetic Algorithms
YI Jiangfang,TONG Dong,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract764)            Save
The approaches of simulation-based validation need a large amount of simulation-vectors for verifying the corner cases of VLSI designs. The authors developed a validation platform of automatic simulation vectors generation based on the path coverage metric using genetic algorithm for RT-level designs. Given the critical signals, it used techniques of data flow analysis to acquire the critical path set and choose the critical path coverage to be the fitness function used in the GA. The authors performed experiments on some functional modules of Unity-863 SoC. The relationship between the final results and the control factors were also analyzed in detail. The results show that GATEST is effective and efficient.
Related Articles | Metrics | Comments0
Design Features of a High Throughput RSA Cryptoprocessor
LIU Qiang,MA Fangzhen,TONG Dong,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract653)            Save
Montgomery multiplication algorithm is optimized for large-bit modular multiplication and VLSI implementation. It is combined with the R-L (Right to Left) binary method to achieve speed improvement. Special efforts are focused on the problems with long-bit modular arithmetic. A Carry-Save-Adder architecture, which is implemented by redesigned (4:2) compressors, is used in the multiplier to avoid the long carry propagation. A signal-backup strategy is used to resolve the problem of signal broadcasting. Using a multiplexer-based method, the datapath of the multiplier is reconfigurable to perform either one 1024-bit-multiplication or two 512-bit multiplications in parallel. The Chinese Remainder Theorem (CRT) increases the decryption data rate by a factor of 3.8.
Related Articles | Metrics | Comments0
The Effect of Periodic Disturbance to the Hierarchical Structure in Turbulent Boundary Layer
CHENG Xueling,HU Fei
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract644)            Save
Artificial periodic disturbances are introduced to the outer field of turbulent boundary layer in an closed-circuit open water channel. Statistical method is employed for analyzing the velocity-fluctuation-time-series. The effect of the disturbance to turbulent structure in boundary layer is studied. The result indicates the She-Leveque hierarchical similarity exists among high frequency turbulence.
Related Articles | Metrics | Comments0
RSA Cryptoprocessor Based on a Redesigned Systolic Array
LIU Qiang,MA Fangzhen,TONG Dong,CHENG Xu
Acta Scientiarum Naturalium Universitatis Pekinensis   
Abstract663)            Save
A novel and generic approach is presented to the hardware implementation of the RSA cryptoprocessor in deep submicro (DSM) technology with a redesigned systolic array. With deep submicro technology scaling, integrated circuit performance bottleneck has shifted from logic gates to global interconnection. Besides using the systolic architecture which is popular in hardwarebased RSA systems, a blockbased scheme is proposed to eliminate global signals, with a pipelined bus to convey data globally. The control signals and intermediate results used for sequential multiplications are transmitted by shift registers. All signals, except for the clock signal, are limited in one block or between two adjacent blocks. The Chinese Remainder Theorem (CRT) technique increases the decryption data rate by a factor of four. Two redundant blocks are added to adapt to the online partition of the multiplier and the variation of the length of P and Q in CRT mode. The blockbased global signal transportation scheme and the redundancy scheme are quite different from those of previous works.
Related Articles | Metrics | Comments0